Exploring the Data

This notebook seeks to explore interesting trends, relationships and other features in the data

Airline IATA Code Data

Bar Charts

This section will attempt to visually examine relationships between singular attributes

ArrDelay vs AirTime

The bar chart shows the proportion of flights that are delayed according to the average length of a given flight. This graph looks for a possible relation between time spent in the air and arrival delay. From the histogram below, it seems that (down to around 20 minutes) there is some inverse relation between time spent in the air and the number of flights delayed.

Arrival Delay against Departure Delay

Here we look for the relation between a flight arriving delayed if it departed late. The intuition is that time cannot be 'taken back' while flying (although there is a technique in aviation where you fly with the wind to arrive more quickly). From the graph, it seems that there is a fair linear correlation between Arrival Delay and Departure delay (in minutes). The lower range shows a significant increase in arrival delay for small departure delays, this may be investigated furhter

Arrival Delay against Arrival Time

This section looks for correlation between whether a flight is delayed, and its actual time of arrival. The theory is that certain flights may be delayed more than usual at different times of day due to a difference in traffic, etc. However, from visual inspection of the graph below, there does not seem to be any significant linear relation with arrival delay and arrival time. Additionally, the variance of Arrival Delay does not seem to be constant throughout Arrival Time

Difference Between CRS (Scheduled) Times and Actual Time

Aiming to visualise how strongly (or weakly) CRS time correlates with actual arrival and departure times. This may give a sense for how much variance actual arrival and departure times exhibit with-respect-to scheduled (CRS) times. Red-coloured points are those that are delayed by more than 15 minutes

Actual Elapsed Time vs Estimated Elapsed Time

Looking for Correlations

First, we look for the relationship between airline and arrival delay by charting how many flights are delayed as a percentage of all flights flown by the airline. This controls for the idea that more flights for a given airline increases the chance of a flight being delayed, which may unevenly result in larger/busier airlines having aboslutely larger numbers of flight delay

Top-10 Most Delayed Airlines

Zeroing in from the above, it can be shown that Piedmont Aviation, Ukraine International Airways and JetBlue Airways seem have the highest flight delays

Day of Week

Exploring to see whether certain days are more prone to having flight delays, from the second bar chart, it seems that the chance of your flight being delayed in the middle of the week iss higher

Arrival Delay By Time-of-Year in Weeks

Visualising Number of Arrival Days over Time

Predicting Chance of an Arrival Delay

To encode: Dest DestState Origin OriginState

Feature Selection

Looking for Multi-Collinearity

Variable Inflation Factor regresses each input feature as a function of all other features, typicalls VIFs with values above 5 or 10 should be excluded from the feature set, since this generally implies a strong relationship between the feature in question and other features

From the below, AirTime, DistanceGroup, Number of Flights and TaxiOut time are potential candidates for dropping

Rebuilding the model with the above features excluded

From the above, it seems that the most important indicator of a flight arriving late is whether is has a departure delay greater than 15 minutes

Removing Highest Predictors

Random Forest Training (external file)

see Machine Learning File